This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.

Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter.

You are working for a telecommunications company in their customer analytics team. Your manager wants you to analyze customer data and provide insights for retaining existing customers. A big part of customer relationship management involves efforts to retain existing customers. Note that the costs incurred to attract new customers are much larger than the costs to retain existing customers. As part of the class project, your goal is to analyze customer data and identify the ones that are likely to leave so that you can take some action to retain them by offering incentives. The data is spread over two .rda files, Model_Building_Data and Implementation_Data. Model_Building_Data should be used to build the model. It contains current customers as well as customers who have left. The description of the columns in this dataset is below:

1. Data preparation and Exploratory Data Analysis

# Load the dataset
load("Model_Building_Data.rda")
# Load required libraries
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ ggplot2   3.5.1     ✔ stringr   1.5.1
## ✔ lubridate 1.9.4     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Use glimpse to view variable types and preview values
glimpse(Model_Building_Data)
## Rows: 6,499
## Columns: 17
## $ Gender           <fct> Female, Male, Male, Male, Female, Female, Male, Femal…
## $ SeniorCitizen    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Partner          <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes, No, Ye…
## $ Dependents       <fct> No, No, No, No, No, No, Yes, No, No, Yes, Yes, No, No…
## $ Tenure           <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService     <fct> No, Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes, Yes, Y…
## $ MultipleLines    <fct> No phone service, No, No, No phone service, No, Yes, …
## $ InternetService  <fct> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic, Fiber o…
## $ OnlineSecurity   <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Yes, No …
## $ DeviceProtection <fct> No, Yes, No, Yes, No, Yes, No, No, Yes, No, No, No in…
## $ StreamingMovies  <fct> No, No, No, No, No, Yes, No, No, Yes, No, No, No inte…
## $ Contract         <fct> Month-to-month, One year, Month-to-month, One year, M…
## $ PaperlessBilling <fct> Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, No, Yes, No…
## $ PaymentMethod    <fct> Electronic check, Mailed check, Mailed check, Bank tr…
## $ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Status           <chr> "Current", "Current", "Left", "Current", "Left", "Lef…

We are converting Status and SeniorCitizen to factors because both are categorical variables — Status is the target label for classification, and SeniorCitizen is a binary (0/1) feature that should be treated as a category, not a numeric scale.

# Convert the target variable 'Status' from character to factor
# This allows classification models to treat it as a category 
Model_Building_Data$Status <- as.factor(Model_Building_Data$Status)

# Convert 'SeniorCitizen' from integer (0/1) to factor
# 0 = Not a senior, 1 = Senior — it's a binary category, not a numeric trend
# Treating it as a factor helps models understand it's a yes/no type variable
Model_Building_Data$SeniorCitizen <- as.factor(Model_Building_Data$SeniorCitizen)

# Check the updated structure of the dataset after conversions
glimpse(Model_Building_Data)
## Rows: 6,499
## Columns: 17
## $ Gender           <fct> Female, Male, Male, Male, Female, Female, Male, Femal…
## $ SeniorCitizen    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Partner          <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes, No, Ye…
## $ Dependents       <fct> No, No, No, No, No, No, Yes, No, No, Yes, Yes, No, No…
## $ Tenure           <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService     <fct> No, Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes, Yes, Y…
## $ MultipleLines    <fct> No phone service, No, No, No phone service, No, Yes, …
## $ InternetService  <fct> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic, Fiber o…
## $ OnlineSecurity   <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Yes, No …
## $ DeviceProtection <fct> No, Yes, No, Yes, No, Yes, No, No, Yes, No, No, No in…
## $ StreamingMovies  <fct> No, No, No, No, No, Yes, No, No, Yes, No, No, No inte…
## $ Contract         <fct> Month-to-month, One year, Month-to-month, One year, M…
## $ PaperlessBilling <fct> Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, No, Yes, No…
## $ PaymentMethod    <fct> Electronic check, Mailed check, Mailed check, Bank tr…
## $ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Status           <fct> Current, Current, Left, Current, Left, Left, Current,…
# Check total missing values column-wise
# This helps identify which variables (if any) have NA values
colSums(is.na(Model_Building_Data))
##           Gender    SeniorCitizen          Partner       Dependents 
##                0                0                0                0 
##           Tenure     PhoneService    MultipleLines  InternetService 
##                0                0                0                0 
##   OnlineSecurity DeviceProtection  StreamingMovies         Contract 
##                0                0                0                0 
## PaperlessBilling    PaymentMethod   MonthlyCharges     TotalCharges 
##                0                0                0               11 
##           Status 
##                0
# Remove rows with missing values (11 rows have missing TotalCharges)
# This is a safe choice since only a small number of rows are affected
Model_Building_Data <- na.omit(Model_Building_Data)

# Re-check to confirm missing values are gone
colSums(is.na(Model_Building_Data))
##           Gender    SeniorCitizen          Partner       Dependents 
##                0                0                0                0 
##           Tenure     PhoneService    MultipleLines  InternetService 
##                0                0                0                0 
##   OnlineSecurity DeviceProtection  StreamingMovies         Contract 
##                0                0                0                0 
## PaperlessBilling    PaymentMethod   MonthlyCharges     TotalCharges 
##                0                0                0                0 
##           Status 
##                0

We removed 11 rows with missing values in TotalCharges using na.omit() to ensure the dataset is clean and ready for modeling. Since only a small number of rows are affected we have choosen to remove them.

# Summary statistics for numerical variables
summary(Model_Building_Data[, c("Tenure", "MonthlyCharges", "TotalCharges")])
##      Tenure      MonthlyCharges    TotalCharges   
##  Min.   : 1.00   Min.   : 18.25   Min.   :  18.8  
##  1st Qu.: 9.00   1st Qu.: 35.29   1st Qu.: 396.2  
##  Median :29.00   Median : 70.35   Median :1389.7  
##  Mean   :32.31   Mean   : 64.70   Mean   :2273.5  
##  3rd Qu.:55.00   3rd Qu.: 89.90   3rd Qu.:3773.4  
##  Max.   :72.00   Max.   :118.75   Max.   :8684.8
# Standard deviation for numeric variables
sapply(Model_Building_Data[, c("Tenure", "MonthlyCharges", "TotalCharges")], sd)
##         Tenure MonthlyCharges   TotalCharges 
##       24.56461       30.18221     2268.75008
# Frequency tables for a few categorical variables
table(Model_Building_Data$Contract)
## 
## Month-to-month       One year       Two year 
##           3581           1351           1556
table(Model_Building_Data$PaymentMethod)
## 
## Bank transfer (automatic)   Credit card (automatic)          Electronic check 
##                      1423                      1393                      2171 
##              Mailed check 
##                      1501
table(Model_Building_Data$InternetService)
## 
##         DSL Fiber optic          No 
##        2216        2855        1417
# Load ggplot2 for plotting
library(ggplot2)
# Plot 1: Gender Distribution
# This bar plot shows how many male and female customers are in the dataset.
ggplot(Model_Building_Data, aes(x = Gender)) +
  geom_bar(fill = "skyblue") +
  labs(title = "Distribution of Gender", x = "Gender", y = "Count") +
  theme_minimal()

The dataset has a nearly equal number of male and female customers.

# Distribution of customer status (Current or Left)
ggplot(Model_Building_Data, aes(x = Status)) +
  geom_bar(fill = "skyblue") +
  labs(title = "Customer Status Distribution", x = "Status", y = "Count") +
  theme_minimal()

# Distribution of Senior Citizen status (0 = Not a senior, 1 = Senior)
ggplot(Model_Building_Data, aes(x = SeniorCitizen)) +
  geom_bar(fill = "orange") +
  labs(title = "Senior Citizen Status", x = "Senior Citizen (0 = No, 1 = Yes)", y = "Count") +
  theme_minimal()

# Distribution of Partner status (Yes or No)
ggplot(Model_Building_Data, aes(x = Partner)) +
  geom_bar(fill = "mediumorchid") +
  labs(title = "Partner Status", x = "Has Partner", y = "Count") +
  theme_minimal()

# Distribution of Dependents (Yes or No)
ggplot(Model_Building_Data, aes(x = Dependents)) +
  geom_bar(fill = "darkorange") +
  labs(title = "Dependents Status", x = "Has Dependents", y = "Count") +
  theme_minimal()

# Distribution of tenure
# This histogram shows how long customers have stayed with the company (in months).
ggplot(Model_Building_Data, aes(x = Tenure)) +
  geom_histogram(binwidth = 5, fill = "steelblue", color = "white") +
  labs(title = "Distribution of Customer Tenure", x = "Tenure (Months)", y = "Count") +
  theme_minimal()

# Distribution of Phone Service
# This bar plot shows how many customers have phone service versus those who do not.
ggplot(Model_Building_Data, aes(x = PhoneService)) +
  geom_bar(fill = "mediumseagreen") +
  labs(title = "Distribution of Phone Service", x = "Phone Service", y = "Count") +
  theme_minimal()

# Distribution of Multiple Lines
# This bar plot shows how many customers have multiple phone lines, do not have them, or have no phone service.
ggplot(Model_Building_Data, aes(x = MultipleLines)) +
  geom_bar(fill = "darkgoldenrod1") +
  labs(title = "Distribution of Multiple Lines", x = "Multiple Lines", y = "Count") +
  theme_minimal()

# Distribution of Internet Service
# This bar plot shows the types of internet services customers are subscribed to.
ggplot(Model_Building_Data, aes(x = InternetService)) +
  geom_bar(fill = "cornflowerblue") +
  labs(title = "Distribution of Internet Service", x = "Internet Service Type", y = "Count") +
  theme_minimal()

# Distribution of Online Security
# This bar plot shows how many customers have online security, don't have it, or have no internet service.
ggplot(Model_Building_Data, aes(x = OnlineSecurity)) +
  geom_bar(fill = "indianred2") +
  labs(title = "Distribution of Online Security", x = "Online Security", y = "Count") +
  theme_minimal()

# Distribution of Device Protection
# This bar plot shows how many customers have device protection, do not have it, or have no internet service.
ggplot(Model_Building_Data, aes(x = DeviceProtection)) +
  geom_bar(fill = "deepskyblue3") +
  labs(title = "Distribution of Device Protection", x = "Device Protection", y = "Count") +
  theme_minimal()

# Distribution of Contract Types
# This bar plot shows the number of customers based on their contract length.
ggplot(Model_Building_Data, aes(x = Contract)) +
  geom_bar(fill = "darkslateblue") +
  labs(title = "Distribution of Contract Type", x = "Contract Type", y = "Count") +
  theme_minimal()

# Distribution of Paperless Billing
# This bar plot shows how many customers have enabled paperless billing.
ggplot(Model_Building_Data, aes(x = PaperlessBilling)) +
  geom_bar(fill = "tomato") +
  labs(title = "Distribution of Paperless Billing", x = "Paperless Billing", y = "Count") +
  theme_minimal()

# Distribution of Payment Methods
# This bar plot shows the number of customers using each payment method.
ggplot(Model_Building_Data, aes(x = PaymentMethod)) +
  geom_bar(fill = "goldenrod") +
  labs(title = "Distribution of Payment Methods", x = "Payment Method", y = "Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 25, hjust = 1))

# Distribution of Monthly Charges
# This histogram shows how much customers are charged monthly.
ggplot(Model_Building_Data, aes(x = MonthlyCharges)) +
  geom_histogram(binwidth = 5, fill = "mediumturquoise", color = "white") +
  labs(title = "Distribution of Monthly Charges", x = "Monthly Charges ($)", y = "Count") +
  theme_minimal()

# Distribution of Total Charges
# This histogram shows the total amount charged to customers over their entire tenure.
ggplot(Model_Building_Data, aes(x = TotalCharges)) +
  geom_histogram(binwidth = 250, fill = "mediumpurple", color = "white") +
  labs(title = "Distribution of Total Charges", x = "Total Charges ($)", y = "Count") +
  theme_minimal()

Status = “Left” → the customer has churned Status = “Current” → the customer is still with the company

# Plot: Distribution of Status (Current vs Left)
ggplot(Model_Building_Data, aes(x = Status)) +
  geom_bar(fill = "steelblue") +
  labs(title = "Customer Churn Distribution",
       x = "Customer Status", y = "Count") +
  theme_minimal()

# Plot: Churn distribution by contract type
ggplot(Model_Building_Data, aes(x = Contract, fill = Status)) +
  geom_bar(position = "fill") +  # This gives proportional stacked bars
  labs(title = "Churn Rate by Contract Type",
       x = "Contract Type", y = "Proportion") +
  theme_minimal()

# Plot: Churn rate by OnlineSecurity status
ggplot(Model_Building_Data, aes(x = OnlineSecurity, fill = Status)) +
  geom_bar(position = "fill") +  # Proportional stacked bars
  labs(title = "Churn Rate by Online Security",
       x = "Online Security Service", y = "Proportion") +
  theme_minimal()

# Plot: Churn rate by customer tenure
ggplot(Model_Building_Data, aes(x = Tenure, fill = Status)) +
  geom_histogram(binwidth = 5, position = "fill") +  # stacked by proportion
  labs(title = "Churn Rate by Customer Tenure",
       x = "Tenure (Months)", y = "Proportion") +
  theme_minimal()

# Plot: Churn rate by senior citizen status
ggplot(Model_Building_Data, aes(x = SeniorCitizen, fill = Status)) +
  geom_bar(position = "fill") +
  labs(title = "Churn Rate by Senior Citizen Status",
       x = "Senior Citizen (0 = No, 1 = Yes)", y = "Proportion") +
  theme_minimal()

# Plot: Churn rate by type of Internet Service
ggplot(Model_Building_Data, aes(x = InternetService, fill = Status)) +
  geom_bar(position = "fill") +  # Shows proportion of churn per service type
  labs(title = "Churn Rate by Internet Service Type",
       x = "Internet Service", y = "Proportion") +
  theme_minimal()

# Set seed for reproducibility
set.seed(123)

# Generate sample row indices for training (80% of the data)
train_indices <- sample(nrow(Model_Building_Data), 0.8 * nrow(Model_Building_Data))

# Split the dataset
train_data <- Model_Building_Data[train_indices, ]
test_data  <- Model_Building_Data[-train_indices, ]

2.Logistic Regression

Model 1

# Logistic regression model using Status as the response variable
log_model_full <- glm(Status ~ SeniorCitizen + Partner + Dependents + Tenure + PhoneService +
                        MultipleLines + InternetService + OnlineSecurity + DeviceProtection +
                        Contract + PaperlessBilling + PaymentMethod + MonthlyCharges +
                        TotalCharges + Gender,
                      data = train_data,
                      family = "binomial")
summary(log_model_full)
## 
## Call:
## glm(formula = Status ~ SeniorCitizen + Partner + Dependents + 
##     Tenure + PhoneService + MultipleLines + InternetService + 
##     OnlineSecurity + DeviceProtection + Contract + PaperlessBilling + 
##     PaymentMethod + MonthlyCharges + TotalCharges + Gender, family = "binomial", 
##     data = train_data)
## 
## Coefficients: (3 not defined because of singularities)
##                                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                           8.363e-02  2.722e-01   0.307 0.758630    
## SeniorCitizen1                        2.554e-01  9.699e-02   2.634 0.008443 ** 
## PartnerYes                            1.543e-02  9.129e-02   0.169 0.865816    
## DependentsYes                        -2.181e-01  1.060e-01  -2.057 0.039660 *  
## Tenure                               -6.819e-02  7.395e-03  -9.221  < 2e-16 ***
## PhoneServiceYes                      -9.212e-01  1.804e-01  -5.106 3.30e-07 ***
## MultipleLinesNo phone service                NA         NA      NA       NA    
## MultipleLinesYes                      2.287e-01  9.612e-02   2.379 0.017374 *  
## InternetServiceFiber optic            6.972e-01  1.698e-01   4.105 4.04e-05 ***
## InternetServiceNo                    -4.297e-01  2.256e-01  -1.904 0.056887 .  
## OnlineSecurityNo internet service            NA         NA      NA       NA    
## OnlineSecurityYes                    -5.894e-01  1.017e-01  -5.793 6.92e-09 ***
## DeviceProtectionNo internet service          NA         NA      NA       NA    
## DeviceProtectionYes                  -1.862e-01  9.982e-02  -1.865 0.062138 .  
## ContractOne year                     -6.593e-01  1.277e-01  -5.165 2.41e-07 ***
## ContractTwo year                     -1.306e+00  1.987e-01  -6.571 4.98e-11 ***
## PaperlessBillingYes                   2.766e-01  8.621e-02   3.209 0.001334 ** 
## PaymentMethodCredit card (automatic) -1.345e-01  1.347e-01  -0.999 0.317864    
## PaymentMethodElectronic check         4.146e-01  1.106e-01   3.749 0.000177 ***
## PaymentMethodMailed check            -8.033e-02  1.350e-01  -0.595 0.551723    
## MonthlyCharges                        6.082e-03  5.336e-03   1.140 0.254335    
## TotalCharges                          4.083e-04  8.289e-05   4.926 8.38e-07 ***
## GenderMale                           -1.404e-02  7.556e-02  -0.186 0.852555    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6019.5  on 5189  degrees of freedom
## Residual deviance: 4292.1  on 5170  degrees of freedom
## AIC: 4332.1
## 
## Number of Fisher Scoring iterations: 6

Significant factors associated with customers discontinuing service:

Factors that were not statistically significant:

Note: Some factor levels (e.g., “No phone service” under MultipleLines) were automatically excluded by R due to multicollinearity with other variables. These exclusions are expected and do not affect model integrity.

# Predict probabilities from the full model
predicted_probs <- predict(log_model_full, newdata = test_data, type = "response")
# Load the pROC package
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
# Predict probabilities on the test set
predicted_probs <- predict(log_model_full, newdata = test_data, type = "response")

# Generate ROC curve and calculate AUC
roc_obj <- roc(test_data$Status, predicted_probs)
## Setting levels: control = Current, case = Left
## Setting direction: controls < cases
# Print AUC
auc(roc_obj)
## Area under the curve: 0.8446
# Plot ROC curve
plot(roc_obj, col = "blue", main = "ROC Curve - Full Logistic Regression Model")

Model 2

# Model 2
log_model_reduced <- glm(Status ~ SeniorCitizen + Tenure + PhoneService + MultipleLines +
                          InternetService + OnlineSecurity + Contract + PaperlessBilling +
                          PaymentMethod + MonthlyCharges + TotalCharges,
                        data = train_data,
                        family = "binomial")

summary(log_model_reduced)
## 
## Call:
## glm(formula = Status ~ SeniorCitizen + Tenure + PhoneService + 
##     MultipleLines + InternetService + OnlineSecurity + Contract + 
##     PaperlessBilling + PaymentMethod + MonthlyCharges + TotalCharges, 
##     family = "binomial", data = train_data)
## 
## Coefficients: (2 not defined because of singularities)
##                                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                           1.245e-01  2.644e-01   0.471 0.637643    
## SeniorCitizen1                        2.907e-01  9.522e-02   3.053 0.002265 ** 
## Tenure                               -6.932e-02  7.392e-03  -9.378  < 2e-16 ***
## PhoneServiceYes                      -8.410e-01  1.744e-01  -4.822 1.42e-06 ***
## MultipleLinesNo phone service                NA         NA      NA       NA    
## MultipleLinesYes                      2.521e-01  9.512e-02   2.650 0.008046 ** 
## InternetServiceFiber optic            8.072e-01  1.615e-01   4.997 5.81e-07 ***
## InternetServiceNo                    -5.271e-01  2.205e-01  -2.391 0.016811 *  
## OnlineSecurityNo internet service            NA         NA      NA       NA    
## OnlineSecurityYes                    -5.711e-01  1.007e-01  -5.669 1.44e-08 ***
## ContractOne year                     -6.808e-01  1.273e-01  -5.349 8.87e-08 ***
## ContractTwo year                     -1.349e+00  1.979e-01  -6.816 9.37e-12 ***
## PaperlessBillingYes                   2.849e-01  8.602e-02   3.312 0.000926 ***
## PaymentMethodCredit card (automatic) -1.305e-01  1.345e-01  -0.971 0.331685    
## PaymentMethodElectronic check         4.293e-01  1.104e-01   3.890 0.000100 ***
## PaymentMethodMailed check            -7.842e-02  1.347e-01  -0.582 0.560524    
## MonthlyCharges                        1.968e-03  4.904e-03   0.401 0.688137    
## TotalCharges                          4.154e-04  8.306e-05   5.001 5.71e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6019.5  on 5189  degrees of freedom
## Residual deviance: 4300.5  on 5174  degrees of freedom
## AIC: 4332.5
## 
## Number of Fisher Scoring iterations: 6
# Predict probabilities
reduced_probs <- predict(log_model_reduced, newdata = test_data, type = "response")

# AUC with pROC
library(pROC)
roc_reduced <- roc(test_data$Status, reduced_probs)
## Setting levels: control = Current, case = Left
## Setting direction: controls < cases
auc(roc_reduced)
## Area under the curve: 0.8465
# Plot ROC
plot(roc_reduced, col = "darkgreen", main = "ROC Curve - Reduced Model")

After trying several model; - The logistic regression model achieved an AUC of 0.8446 which is highest so we will proceed with this.

# Step 1: Predict probabilities from logistic regression model
log_probs <- predict(log_model_reduced, newdata = test_data, type = "response")
# Threshold = 0.4
log_preds_th0.4 <- ifelse(log_probs > 0.4, "Left", "Current")
cat("Confusion Matrix – Threshold 0.4:\n")
## Confusion Matrix – Threshold 0.4:
print(table(log_preds_th0.4, test_data$Status))
##                
## log_preds_th0.4 Current Left
##         Current     813  122
##         Left        138  225
cat("Prediction Accuracy (0.4):", mean(log_preds_th0.4 == test_data$Status), "\n\n")
## Prediction Accuracy (0.4): 0.7996918
# Threshold = 0.5
log_preds_th0.5 <- ifelse(log_probs > 0.5, "Left", "Current")
cat("Confusion Matrix – Threshold 0.5:\n")
## Confusion Matrix – Threshold 0.5:
print(table(log_preds_th0.5, test_data$Status))
##                
## log_preds_th0.5 Current Left
##         Current     862  167
##         Left         89  180
cat("Prediction Accuracy (0.5):", mean(log_preds_th0.5 == test_data$Status), "\n\n")
## Prediction Accuracy (0.5): 0.8027735
# Threshold = 0.6
log_preds_th0.6 <- ifelse(log_probs > 0.6, "Left", "Current")
cat("Confusion Matrix – Threshold 0.6:\n")
## Confusion Matrix – Threshold 0.6:
print(table(log_preds_th0.6, test_data$Status))
##                
## log_preds_th0.6 Current Left
##         Current     907  225
##         Left         44  122
cat("Prediction Accuracy (0.6):", mean(log_preds_th0.6 == test_data$Status), "\n\n")
## Prediction Accuracy (0.6): 0.7927581
# Create summary data frame
logit_threshold_df <- data.frame(
  Threshold = c(0.4, 0.5, 0.6),
  Accuracy = c(0.7996918, 0.8027735, 0.7927581)
)

# View the table
print(logit_threshold_df)
##   Threshold  Accuracy
## 1       0.4 0.7996918
## 2       0.5 0.8027735
## 3       0.6 0.7927581

3.Naive Bayes

# Load the package to access Naive Bayes
library(e1071)
library(ggplot2)
# Load ggplot2 if not already loaded
library(ggplot2)

# Histogram for Tenure
ggplot(Model_Building_Data) +
  geom_histogram(aes(x = Tenure), bins = 30, fill = "steelblue", color = "white") +
  labs(title = "Distribution of Tenure", x = "Tenure (Months)", y = "Count")

# Histogram for MonthlyCharges
ggplot(Model_Building_Data) +
  geom_histogram(aes(x = MonthlyCharges), bins = 30, fill = "darkorange", color = "white") +
  labs(title = "Distribution of Monthly Charges", x = "Monthly Charges ($)", y = "Count")

# Histogram for TotalCharges
ggplot(Model_Building_Data) +
  geom_histogram(aes(x = TotalCharges), bins = 30, fill = "seagreen", color = "white") +
  labs(title = "Distribution of Total Charges", x = "Total Charges ($)", y = "Count")

  1. Tenure Histogram Shape: U-shaped (many customers with low tenure, and a spike again around 70 months) Custom Bins Used: Short < 20 Medium = 20 to 50 Long > 50 Good binning choice: separates early, mid, and loyal customers.
  2. MonthlyCharges Histogram Shape: Right-skewed, with a heavy concentration around $20–$40 Custom Bins Used: Low < 35 Medium = 35 to 85 High > 85 Logical: captures discount/basic plans, standard range, and premium/high-bill customers.
  3. TotalCharges Histogram Shape: Strong right skew — lots of low values, few high-value customers Custom Bins Used: Low < 1500 Medium = 1500 to 5000 High > 5000 Well-matched: separates low-lifetime-value customers from medium and high tiers.
# create binned variables after train-test split
train_data <- train_data %>%
  mutate(
    Tenure_binned = ifelse(Tenure < 20, "Short",
                     ifelse(Tenure > 50, "Long", "Medium")),
    MonthlyCharges_binned = ifelse(MonthlyCharges < 35, "Low",
                             ifelse(MonthlyCharges > 85, "High", "Medium")),
    TotalCharges_binned = ifelse(TotalCharges < 1500, "Low",
                           ifelse(TotalCharges > 5000, "High", "Medium"))
  ) %>%
  mutate(
    Tenure_binned = as.factor(Tenure_binned),
    MonthlyCharges_binned = as.factor(MonthlyCharges_binned),
    TotalCharges_binned = as.factor(TotalCharges_binned)
  )

test_data <- test_data %>%
  mutate(
    Tenure_binned = ifelse(Tenure < 20, "Short",
                     ifelse(Tenure > 50, "Long", "Medium")),
    MonthlyCharges_binned = ifelse(MonthlyCharges < 35, "Low",
                             ifelse(MonthlyCharges > 85, "High", "Medium")),
    TotalCharges_binned = ifelse(TotalCharges < 1500, "Low",
                           ifelse(TotalCharges > 5000, "High", "Medium"))
  ) %>%
  mutate(
    Tenure_binned = as.factor(Tenure_binned),
    MonthlyCharges_binned = as.factor(MonthlyCharges_binned),
    TotalCharges_binned = as.factor(TotalCharges_binned)
  )
str(Model_Building_Data)
## 'data.frame':    6488 obs. of  17 variables:
##  $ Gender          : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
##  $ SeniorCitizen   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Partner         : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
##  $ Dependents      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
##  $ Tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
##  $ MultipleLines   : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
##  $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
##  $ OnlineSecurity  : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
##  $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
##  $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
##  $ Contract        : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
##  $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
##  $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Status          : Factor w/ 2 levels "Current","Left": 1 1 2 1 2 2 1 1 2 1 ...
##  - attr(*, "na.action")= 'omit' Named int [1:11] 459 700 873 1007 1249 3090 3546 4067 4844 6196 ...
##   ..- attr(*, "names")= chr [1:11] "489" "754" "937" "1083" ...

Model L1

# Full Naive Bayes model with all relevant variables
nb_model_full <- naiveBayes(Status ~ SeniorCitizen + Partner + Dependents + PhoneService +
                            MultipleLines + InternetService + OnlineSecurity + DeviceProtection +
                            StreamingMovies + Contract + PaperlessBilling + PaymentMethod + Gender +
                            Tenure_binned + MonthlyCharges_binned + TotalCharges_binned,
                            data = train_data,
                            laplace = 0.01)

# View summary
nb_model_full
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##   Current      Left 
## 0.7333333 0.2666667 
## 
## Conditional probabilities:
##          SeniorCitizen
## Y                 0         1
##   Current 0.8707285 0.1292715
##   Left    0.7377133 0.2622867
## 
##          Partner
## Y                No       Yes
##   Current 0.4753023 0.5246977
##   Left    0.6473967 0.3526033
## 
##          Dependents
## Y                No       Yes
##   Current 0.6563313 0.3436687
##   Left    0.8345327 0.1654673
## 
##          PhoneService
## Y                 No        Yes
##   Current 0.09695430 0.90304570
##   Left    0.08960131 0.91039869
## 
##          MultipleLines
## Y                 No No phone service        Yes
##   Current 0.49474387       0.09695404 0.40830209
##   Left    0.45953484       0.08960066 0.45086450
## 
##          InternetService
## Y                DSL Fiber optic         No
##   Current 0.37283206  0.35128730 0.27588064
##   Left    0.23266114  0.70519425 0.06214461
## 
##          OnlineSecurity
## Y                 No No internet service        Yes
##   Current 0.39174941          0.27588064 0.33236995
##   Left    0.78900746          0.06214461 0.14884793
## 
##          DeviceProtection
## Y                 No No internet service        Yes
##   Current 0.36100872          0.27588064 0.36311064
##   Left    0.65822995          0.06214461 0.27962544
## 
##          StreamingMovies
## Y                 No No internet service        Yes
##   Current 0.34997359          0.27588064 0.37414576
##   Left    0.50433155          0.06214461 0.43352384
## 
##          Contract
## Y         Month-to-month   One year   Two year
##   Current     0.43326248 0.24855558 0.31818194
##   Left        0.88799376 0.08382044 0.02818581
## 
##          PaperlessBilling
## Y                No       Yes
##   Current 0.4611142 0.5388858
##   Left    0.2608416 0.7391584
## 
##          PaymentMethod
## Y         Bank transfer (automatic) Credit card (automatic) Electronic check
##   Current                 0.2451393               0.2490804        0.2482922
##   Left                    0.1315063               0.1170559        0.5895856
##          PaymentMethod
## Y         Mailed check
##   Current    0.2574881
##   Left       0.1618523
## 
##          Gender
## Y            Female      Male
##   Current 0.4926432 0.5073568
##   Left    0.5072253 0.4927747
## 
##          Tenure_binned
## Y              Long    Medium     Short
##   Current 0.3691537 0.3258014 0.3050449
##   Left    0.1011611 0.2290485 0.6697904
## 
##          MonthlyCharges_binned
## Y              High       Low    Medium
##   Current 0.2777198 0.2992646 0.4230156
##   Left    0.3995650 0.1033287 0.4971063
## 
##          TotalCharges_binned
## Y               High        Low     Medium
##   Current 0.18444679 0.46952073 0.34603248
##   Left    0.08237538 0.67268051 0.24494411
library(pROC)

# Predict probabilities for "Left"
nb_probs_full <- predict(nb_model_full, newdata = test_data, type = "raw")[, "Left"]

# Compute ROC object
nb_roc_full <- roc(response = test_data$Status,
                   predictor = nb_probs_full,
                   levels = c("Current", "Left"),
                   direction = "<")

# Plot the ROC curve
plot(nb_roc_full, main = "Naive Bayes – Full Model ROC Curve", col = "darkred", lwd = 3)

# Show AUC value
auc(nb_roc_full)
## Area under the curve: 0.8239

Model N2

# Train Naive Bayes model 
model_r <- naiveBayes(Status ~ Tenure_binned + MonthlyCharges_binned + TotalCharges_binned +
                      Contract + InternetService + OnlineSecurity + MultipleLines +
                      PhoneService + PaperlessBilling + PaymentMethod +
                      SeniorCitizen + StreamingMovies + Partner,
                      data = train_data, laplace = 0.01)


model_r
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##   Current      Left 
## 0.7333333 0.2666667 
## 
## Conditional probabilities:
##          Tenure_binned
## Y              Long    Medium     Short
##   Current 0.3691537 0.3258014 0.3050449
##   Left    0.1011611 0.2290485 0.6697904
## 
##          MonthlyCharges_binned
## Y              High       Low    Medium
##   Current 0.2777198 0.2992646 0.4230156
##   Left    0.3995650 0.1033287 0.4971063
## 
##          TotalCharges_binned
## Y               High        Low     Medium
##   Current 0.18444679 0.46952073 0.34603248
##   Left    0.08237538 0.67268051 0.24494411
## 
##          Contract
## Y         Month-to-month   One year   Two year
##   Current     0.43326248 0.24855558 0.31818194
##   Left        0.88799376 0.08382044 0.02818581
## 
##          InternetService
## Y                DSL Fiber optic         No
##   Current 0.37283206  0.35128730 0.27588064
##   Left    0.23266114  0.70519425 0.06214461
## 
##          OnlineSecurity
## Y                 No No internet service        Yes
##   Current 0.39174941          0.27588064 0.33236995
##   Left    0.78900746          0.06214461 0.14884793
## 
##          MultipleLines
## Y                 No No phone service        Yes
##   Current 0.49474387       0.09695404 0.40830209
##   Left    0.45953484       0.08960066 0.45086450
## 
##          PhoneService
## Y                 No        Yes
##   Current 0.09695430 0.90304570
##   Left    0.08960131 0.91039869
## 
##          PaperlessBilling
## Y                No       Yes
##   Current 0.4611142 0.5388858
##   Left    0.2608416 0.7391584
## 
##          PaymentMethod
## Y         Bank transfer (automatic) Credit card (automatic) Electronic check
##   Current                 0.2451393               0.2490804        0.2482922
##   Left                    0.1315063               0.1170559        0.5895856
##          PaymentMethod
## Y         Mailed check
##   Current    0.2574881
##   Left       0.1618523
## 
##          SeniorCitizen
## Y                 0         1
##   Current 0.8707285 0.1292715
##   Left    0.7377133 0.2622867
## 
##          StreamingMovies
## Y                 No No internet service        Yes
##   Current 0.34997359          0.27588064 0.37414576
##   Left    0.50433155          0.06214461 0.43352384
## 
##          Partner
## Y                No       Yes
##   Current 0.4753023 0.5246977
##   Left    0.6473967 0.3526033
## Predict probabilities for "Left"
probs_r <- predict(model_r, newdata = test_data, type = "raw")[, "Left"]

# Compute ROC object
roc_r <- roc(response = test_data$Status,
             predictor = probs_r,
             levels = c("Current", "Left"),
             direction = "<")

# Plot the ROC curve
plot(roc_r, main = "Naive Bayes – Model R", col = "darkgreen", lwd = 3)

# Show AUC value
auc(roc_r)
## Area under the curve: 0.8316
# Step 1: Get posterior probabilities from Model R
probs_r_all <- predict(model_r, newdata = test_data, type = "raw")
probs_r_left <- probs_r_all[, "Left"]  # probability of customer leaving
# Threshold = 0.5
pred_r_th0.5 <- ifelse(probs_r_left > 0.5, "Left", "Current")
cat("Confusion Matrix – Threshold 0.5:\n")
## Confusion Matrix – Threshold 0.5:
print(table(pred_r_th0.5, test_data$Status))
##             
## pred_r_th0.5 Current Left
##      Current     736   87
##      Left        215  260
cat("Prediction Accuracy (0.5):", mean(pred_r_th0.5 == test_data$Status), "\n\n")
## Prediction Accuracy (0.5): 0.7673344
# Threshold = 0.7
pred_r_th0.7 <- ifelse(probs_r_left > 0.7, "Left", "Current")
cat("Confusion Matrix – Threshold 0.7:\n")
## Confusion Matrix – Threshold 0.7:
print(table(pred_r_th0.7, test_data$Status))
##             
## pred_r_th0.7 Current Left
##      Current     804  130
##      Left        147  217
cat("Prediction Accuracy (0.7):", mean(pred_r_th0.7 == test_data$Status), "\n\n")
## Prediction Accuracy (0.7): 0.7865948
# Threshold = 0.8
pred_r_th0.8 <- ifelse(probs_r_left > 0.8, "Left", "Current")
cat("Confusion Matrix – Threshold 0.8:\n")
## Confusion Matrix – Threshold 0.8:
print(table(pred_r_th0.8, test_data$Status))
##             
## pred_r_th0.8 Current Left
##      Current     837  158
##      Left        114  189
cat("Prediction Accuracy (0.8):", mean(pred_r_th0.8 == test_data$Status), "\n\n")
## Prediction Accuracy (0.8): 0.7904468
# Create data frame with prediction accuracy at each threshold
nb_threshold_results <- data.frame(
  Threshold = c(0.5, 0.7, 0.8),
  Accuracy = c(0.7673344, 0.7865948, 0.7904468)
)

# Print the summary
print(nb_threshold_results)
##   Threshold  Accuracy
## 1       0.5 0.7673344
## 2       0.7 0.7865948
## 3       0.8 0.7904468

4. Linear Discriminant Analysis

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
# Build the full LDA model
lda_model_full <- lda(Status ~ SeniorCitizen + Partner + Dependents + Tenure + 
                      PhoneService + MultipleLines + InternetService + 
                      OnlineSecurity + DeviceProtection + Contract + 
                      PaperlessBilling + PaymentMethod + MonthlyCharges + 
                      TotalCharges + Gender,
                      data = train_data)
## Warning in lda.default(x, grouping, ...): variables are collinear
# Predict on test data
lda_pred_full <- predict(lda_model_full, newdata = test_data)

# Extract probabilities for class "Left"
probs_full <- lda_pred_full$posterior[, "Left"]

# Generate ROC curve
roc_full <- roc(response = test_data$Status,
                predictor = probs_full,
                levels = c("Current", "Left"),
                direction = "<")

# Plot ROC curve
plot(roc_full, main = "ROC – Full LDA Model", col = "purple", lwd = 2)

# Show AUC
auc(roc_full)
## Area under the curve: 0.8381
# This will select only numeric columns from train_data
numeric_vars <- train_data %>%
  select_if(is.numeric)

# View names of numeric variables
names(numeric_vars)
## [1] "Tenure"         "MonthlyCharges" "TotalCharges"
# Compute correlation matrix
cor_matrix <- cor(numeric_vars)

# View correlation matrix
round(cor_matrix, 2)
##                Tenure MonthlyCharges TotalCharges
## Tenure           1.00           0.24         0.82
## MonthlyCharges   0.24           1.00         0.65
## TotalCharges     0.82           0.65         1.00
table(train_data$PhoneService, train_data$MultipleLines)
##      
##         No No phone service  Yes
##   No     0              493    0
##   Yes 2519                0 2178
table(train_data$InternetService, train_data$OnlineSecurity)
##              
##                 No No internet service  Yes
##   DSL          892                   0  849
##   Fiber optic 1691                   0  622
##   No             0                1136    0

-There seems to be Perfect multicollinearity between service-related variables (e.g., MultipleLines and PhoneService, OnlineSecurity and InternetService)

# Model R2: Keep MonthlyCharges + TotalCharges, remove Tenure
lda_model_r2 <- lda(Status ~ SeniorCitizen + Partner + Dependents +
                    PhoneService + InternetService + Contract +
                    PaperlessBilling + PaymentMethod + MonthlyCharges + 
                    TotalCharges + Gender,
                    data = train_data)
lda_model_r2 <- lda(Status ~ SeniorCitizen + Partner + Dependents +
                    PhoneService + InternetService + Contract +
                    PaperlessBilling + PaymentMethod + MonthlyCharges + 
                    TotalCharges + Gender,
                    data = train_data)

lda_pred_r2 <- predict(lda_model_r2, newdata = test_data)
probs_r <- lda_pred_r2$posterior[, "Left"]

roc_r2 <- roc(response = test_data$Status,
              predictor = probs_r,
              levels = c("Current", "Left"),
              direction = "<")

plot(roc_r2, main = "ROC – LDA Model R2", col = "darkgreen", lwd = 2)

auc(roc_r2)
## Area under the curve: 0.8402
# Model R3: Keep Tenure + TotalCharges, remove MonthlyCharges
lda_model_r3 <- lda(Status ~ SeniorCitizen + Partner + Dependents +
                    PhoneService + InternetService + Contract +
                    PaperlessBilling + PaymentMethod + Tenure +
                    TotalCharges + Gender,
                    data = train_data)
lda_model_r3 <- lda(Status ~ SeniorCitizen + Partner + Dependents +
                    PhoneService + InternetService + Contract +
                    PaperlessBilling + PaymentMethod + Tenure +
                    TotalCharges + Gender,
                    data = train_data)

lda_pred_r3 <- predict(lda_model_r3, newdata = test_data)
probs_r3 <- lda_pred_r3$posterior[, "Left"]

roc_r3 <- roc(response = test_data$Status,
              predictor = probs_r3,
              levels = c("Current", "Left"),
              direction = "<")

plot(roc_r3, main = "ROC – LDA Model R3", col = "orange", lwd = 2)

auc(roc_r3)
## Area under the curve: 0.839
# Model R4: Add TotalCharges, drop Dependents
lda_model_r4 <- lda(Status ~ SeniorCitizen + Partner +
                    PhoneService + InternetService + Contract +
                    PaperlessBilling + PaymentMethod + Tenure +
                    MonthlyCharges + TotalCharges + Gender,
                    data = train_data)
lda_model_r4 <- lda(Status ~ SeniorCitizen + Partner +
                    PhoneService + InternetService + Contract +
                    PaperlessBilling + PaymentMethod + Tenure +
                    MonthlyCharges + TotalCharges + Gender,
                    data = train_data)

lda_pred_r4 <- predict(lda_model_r4, newdata = test_data)
probs_r4 <- lda_pred_r4$posterior[, "Left"]

roc_r4 <- roc(response = test_data$Status,
              predictor = probs_r4,
              levels = c("Current", "Left"),
              direction = "<")

plot(roc_r4, main = "ROC – LDA Model R4", col = "red", lwd = 2)

auc(roc_r4)
## Area under the curve: 0.8405
lda_model_r4 
## Call:
## lda(Status ~ SeniorCitizen + Partner + PhoneService + InternetService + 
##     Contract + PaperlessBilling + PaymentMethod + Tenure + MonthlyCharges + 
##     TotalCharges + Gender, data = train_data)
## 
## Prior probabilities of groups:
##   Current      Left 
## 0.7333333 0.2666667 
## 
## Group means:
##         SeniorCitizen1 PartnerYes PhoneServiceYes InternetServiceFiber optic
## Current      0.1292696  0.5246978       0.9030478                  0.3512874
## Left         0.2622832  0.3526012       0.9104046                  0.7052023
##         InternetServiceNo ContractOne year ContractTwo year PaperlessBillingYes
## Current        0.27588019       0.24855491       0.31818182           0.5388860
## Left           0.06213873       0.08381503       0.02817919           0.7391618
##         PaymentMethodCredit card (automatic) PaymentMethodElectronic check
## Current                            0.2490804                     0.2482922
## Left                               0.1170520                     0.5895954
##         PaymentMethodMailed check   Tenure MonthlyCharges TotalCharges
## Current                 0.2574882 37.46926       61.26146     2533.272
## Left                    0.1618497 17.80347       74.55751     1528.829
##         GenderMale
## Current  0.5073568
## Left     0.4927746
## 
## Coefficients of linear discriminants:
##                                                LD1
## SeniorCitizen1                        0.3207039478
## PartnerYes                           -0.0558106716
## PhoneServiceYes                      -0.4188408378
## InternetServiceFiber optic            0.8382195042
## InternetServiceNo                    -0.0758157927
## ContractOne year                     -0.5973819289
## ContractTwo year                     -0.4424887962
## PaperlessBillingYes                   0.2288347463
## PaymentMethodCredit card (automatic) -0.0477659464
## PaymentMethodElectronic check         0.4962240694
## PaymentMethodMailed check            -0.0716122280
## Tenure                               -0.0100363576
## MonthlyCharges                        0.0117782650
## TotalCharges                         -0.0002257877
## GenderMale                           -0.0100234500
# Step 1: Get posterior probabilities from Model R4
lda_pred_r4 <- predict(lda_model_r4, newdata = test_data)
probs_r4_left <- lda_pred_r4$posterior[, "Left"]  # Probabilities for "Left" class
# --- Threshold = 0.4 ---
pred_r4_th0.4 <- ifelse(probs_r4_left > 0.4, "Left", "Current")
cat("Confusion Matrix – Threshold 0.4:\n")
## Confusion Matrix – Threshold 0.4:
print(table(pred_r4_th0.4, test_data$Status))
##              
## pred_r4_th0.4 Current Left
##       Current     825  135
##       Left        126  212
cat("Prediction Accuracy (0.4):", mean(pred_r4_th0.4 == test_data$Status), "\n\n")
## Prediction Accuracy (0.4): 0.7989214
# --- Threshold = 0.5 ---
pred_r4_th0.5 <- ifelse(probs_r4_left > 0.5, "Left", "Current")
cat("Confusion Matrix – Threshold 0.5:\n")
## Confusion Matrix – Threshold 0.5:
print(table(pred_r4_th0.5, test_data$Status))
##              
## pred_r4_th0.5 Current Left
##       Current     862  173
##       Left         89  174
cat("Prediction Accuracy (0.5):", mean(pred_r4_th0.5 == test_data$Status), "\n\n")
## Prediction Accuracy (0.5): 0.798151
# --- Threshold = 0.6 ---
pred_r4_th0.6 <- ifelse(probs_r4_left > 0.6, "Left", "Current")
cat("Confusion Matrix – Threshold 0.6:\n")
## Confusion Matrix – Threshold 0.6:
print(table(pred_r4_th0.6, test_data$Status))
##              
## pred_r4_th0.6 Current Left
##       Current     892  208
##       Left         59  139
cat("Prediction Accuracy (0.6):", mean(pred_r4_th0.6 == test_data$Status), "\n\n")
## Prediction Accuracy (0.6): 0.7942989
# Create a data frame to summarize thresholds and accuracy
lda_threshold_results <- data.frame(
  Threshold = c(0.4, 0.5, 0.6),
  Accuracy = c(0.7989214, 0.798151, 0.7942989)
)

# View the summary table
print(lda_threshold_results)
##   Threshold  Accuracy
## 1       0.4 0.7989214
## 2       0.5 0.7981510
## 3       0.6 0.7942989

5.Quadratic Discriminant Analysis

# Model Q1
qda_q1 <- qda(Status ~ SeniorCitizen + Partner + Dependents + PhoneService +
              InternetService + Contract + PaperlessBilling +
              PaymentMethod + MonthlyCharges + Gender,
              data = train_data)
# Predict probabilities from Q1
qda_pred_q1 <- predict(qda_q1, newdata = test_data)
qda_probs_q1 <- qda_pred_q1$posterior[, "Left"]

# ROC & AUC for Q1
roc_q1 <- roc(response = test_data$Status,
              predictor = qda_probs_q1,
              levels = c("Current", "Left"),
              direction = "<")

# Plot ROC curve
plot(roc_q1, main = "ROC Curve – QDA Model Q1", col = "darkblue", lwd = 2)

# Print AUC
auc(roc_q1)
## Area under the curve: 0.8266
# Model Q2
qda_q2 <- qda(Status ~ SeniorCitizen + Partner + Dependents + PhoneService +
              InternetService + Contract + PaperlessBilling +
              PaymentMethod + Tenure + Gender,
              data = train_data)
# Predict probabilities from Q2
qda_pred_q2 <- predict(qda_q2, newdata = test_data)
qda_probs_q2 <- qda_pred_q2$posterior[, "Left"]

# ROC & AUC for Q2
roc_q2 <- roc(response = test_data$Status,
              predictor = qda_probs_q2,
              levels = c("Current", "Left"),
              direction = "<")

# Plot ROC curve
plot(roc_q2, main = "ROC Curve – QDA Model Q2", col = "darkgreen", lwd = 2)

# Print AUC
auc(roc_q2)
## Area under the curve: 0.8391
# Model Q3
qda_q3 <- qda(Status ~ SeniorCitizen + Partner + Dependents + PhoneService +
         InternetService + Contract + PaperlessBilling +
         PaymentMethod + MonthlyCharges + Tenure + Gender,data = train_data)
# Predict probabilities from Q3
qda_pred_q3 <- predict(qda_q3, newdata = test_data)

# Extract probability for class "Left"
qda_probs_q3 <- qda_pred_q3$posterior[, "Left"]

# Compute ROC and AUC
library(pROC)
roc_q3 <- roc(response = test_data$Status,
              predictor = qda_probs_q3,
              levels = c("Current", "Left"),
              direction = "<")

# Plot ROC curve
plot(roc_q3, main = "ROC Curve – QDA Model Q3", col = "purple", lwd = 2)

# Show AUC value
auc(roc_q3)
## Area under the curve: 0.8407
qda_q3
## Call:
## qda(Status ~ SeniorCitizen + Partner + Dependents + PhoneService + 
##     InternetService + Contract + PaperlessBilling + PaymentMethod + 
##     MonthlyCharges + Tenure + Gender, data = train_data)
## 
## Prior probabilities of groups:
##   Current      Left 
## 0.7333333 0.2666667 
## 
## Group means:
##         SeniorCitizen1 PartnerYes DependentsYes PhoneServiceYes
## Current      0.1292696  0.5246978     0.3436679       0.9030478
## Left         0.2622832  0.3526012     0.1654624       0.9104046
##         InternetServiceFiber optic InternetServiceNo ContractOne year
## Current                  0.3512874        0.27588019       0.24855491
## Left                     0.7052023        0.06213873       0.08381503
##         ContractTwo year PaperlessBillingYes
## Current       0.31818182           0.5388860
## Left          0.02817919           0.7391618
##         PaymentMethodCredit card (automatic) PaymentMethodElectronic check
## Current                            0.2490804                     0.2482922
## Left                               0.1170520                     0.5895954
##         PaymentMethodMailed check MonthlyCharges   Tenure GenderMale
## Current                 0.2574882       61.26146 37.46926  0.5073568
## Left                    0.1618497       74.55751 17.80347  0.4927746
# Get posterior probabilities for "Left"
qda_pred_q3 <- predict(qda_q3, newdata = test_data)
qda_probs_q3_left <- qda_pred_q3$posterior[, "Left"]
# Threshold = 0.4
pred_q3_th0.4 <- ifelse(qda_probs_q3_left > 0.4, "Left", "Current")
cat("Confusion Matrix – Threshold 0.4:\n")
## Confusion Matrix – Threshold 0.4:
print(table(pred_q3_th0.4, test_data$Status))
##              
## pred_q3_th0.4 Current Left
##       Current     716   83
##       Left        235  264
cat("Prediction Accuracy (0.4):", round(mean(pred_q3_th0.4 == test_data$Status), 4), "\n\n")
## Prediction Accuracy (0.4): 0.755
# Threshold = 0.5
pred_q3_th0.5 <- ifelse(qda_probs_q3_left > 0.5, "Left", "Current")
cat("Confusion Matrix – Threshold 0.5:\n")
## Confusion Matrix – Threshold 0.5:
print(table(pred_q3_th0.5, test_data$Status))
##              
## pred_q3_th0.5 Current Left
##       Current     739   92
##       Left        212  255
cat("Prediction Accuracy (0.5):", round(mean(pred_q3_th0.5 == test_data$Status), 4), "\n\n")
## Prediction Accuracy (0.5): 0.7658
# Threshold = 0.6
pred_q3_th0.6 <- ifelse(qda_probs_q3_left > 0.6, "Left", "Current")
cat("Confusion Matrix – Threshold 0.6:\n")
## Confusion Matrix – Threshold 0.6:
print(table(pred_q3_th0.6, test_data$Status))
##              
## pred_q3_th0.6 Current Left
##       Current     770  107
##       Left        181  240
cat("Prediction Accuracy (0.6):", round(mean(pred_q3_th0.6 == test_data$Status), 4), "\n\n")
## Prediction Accuracy (0.6): 0.7781
# Create data frame 
qda_q3_accuracy_df <- data.frame(
  Threshold = c(0.4, 0.5, 0.6),
  Accuracy = c(0.7550, 0.7658, 0.7781)
)

# View it
print(qda_q3_accuracy_df)
##   Threshold Accuracy
## 1       0.4   0.7550
## 2       0.5   0.7658
## 3       0.6   0.7781

6.Decision Trees

library(tree)  # For building decision trees
# Build decision tree model
dt_model <- tree(Status ~ SeniorCitizen + Partner + Dependents + Tenure +
                 PhoneService + MultipleLines + InternetService + OnlineSecurity +
                 DeviceProtection + Contract + PaperlessBilling + PaymentMethod +
                 MonthlyCharges + TotalCharges + Gender,
                 data = train_data)

# Summary of the model
summary(dt_model)
## 
## Classification tree:
## tree(formula = Status ~ SeniorCitizen + Partner + Dependents + 
##     Tenure + PhoneService + MultipleLines + InternetService + 
##     OnlineSecurity + DeviceProtection + Contract + PaperlessBilling + 
##     PaymentMethod + MonthlyCharges + TotalCharges + Gender, data = train_data)
## Variables actually used in tree construction:
## [1] "Contract"        "InternetService" "Tenure"         
## Number of terminal nodes:  6 
## Residual mean deviance:  0.8761 = 4542 / 5184 
## Misclassification error rate: 0.2094 = 1087 / 5190
# Plot the tree
plot(dt_model)
text(dt_model, pretty = 0)

# Predict class labels
dt_preds <- predict(dt_model, newdata = test_data, type = "class")

# Confusion matrix
confusion_matrix_dt <- table(dt_preds, test_data$Status)
print(confusion_matrix_dt)
##          
## dt_preds  Current Left
##   Current     903  229
##   Left         48  118
# Prediction accuracy
accuracy_dt <- mean(dt_preds == test_data$Status)
cat("Prediction Accuracy (Unpruned Tree):", accuracy_dt, "\n")
## Prediction Accuracy (Unpruned Tree): 0.7865948
# Step 1: Perform cross-validation to determine optimal size
cv.churn <- cv.tree(dt_model, FUN = prune.misclass)

# Step 2: Plot tree size vs misclassification error
plot(cv.churn$size, cv.churn$dev, type = "b",
     xlab = "Tree Size (Number of Terminal Nodes)",
     ylab = "Cross-Validation Error",
     main = "CV Error vs. Tree Size")

# Step 3: Identify optimal tree size
best_size <- cv.churn$size[which.min(cv.churn$dev)]
cat("Best tree size:", best_size, "\n")
## Best tree size: 6

We used cross-validation to determine the optimal complexity of the decision tree model. The CV Error vs. Tree Size plot indicated that a tree with 6 terminal nodes minimizes cross-validation error.

# Prune the tree to the best size (6 terminal nodes)
pruned_churn_tree <- prune.tree(dt_model, best = 6)

# Plot the pruned tree
plot(pruned_churn_tree)
text(pruned_churn_tree, pretty = 0)

# Make predictions on test set
pruned_preds <- predict(pruned_churn_tree, newdata = test_data, type = "class")

# Create confusion matrix
conf_matrix_pruned <- table(Predicted = pruned_preds, Actual = test_data$Status)
print(conf_matrix_pruned)
##          Actual
## Predicted Current Left
##   Current     903  229
##   Left         48  118
# Calculate prediction accuracy
accuracy_pruned <- mean(pruned_preds == test_data$Status)
cat("Prediction Accuracy (Pruned Tree - Size 6):", round(accuracy_pruned, 4), "\n")
## Prediction Accuracy (Pruned Tree - Size 6): 0.7866
# Load required library
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
# Set seed for reproducibility
set.seed(123)

# ------------------------- #
# 1. Bagging Model (mtry = total predictors)
# ------------------------- #

bag_model <- randomForest(Status ~ ., 
                          data = train_data, 
                          mtry = ncol(train_data) - 1,  # all predictors
                          importance = TRUE, 
                          ntree = 500)

# Predict on test set
bag_preds <- predict(bag_model, newdata = test_data)

# Confusion Matrix and Accuracy
cat("Confusion Matrix – Bagging:\n")
## Confusion Matrix – Bagging:
print(table(Predicted = bag_preds, Actual = test_data$Status))
##          Actual
## Predicted Current Left
##   Current     844  186
##   Left        107  161
bag_accuracy <- mean(bag_preds == test_data$Status)
cat("Prediction Accuracy (Bagging):", round(bag_accuracy, 4), "\n\n")
## Prediction Accuracy (Bagging): 0.7743
# Variable Importance Plot
varImpPlot(bag_model, main = "Variable Importance – Bagging")

# ------------------------- #
# 2. Random Forest Model (mtry = √p)
# ------------------------- #

rf_model <- randomForest(Status ~ ., 
                         data = train_data, 
                         mtry = floor(sqrt(ncol(train_data) - 1)), 
                         importance = TRUE, 
                         ntree = 500)

# Predict on test data
rf_preds <- predict(rf_model, newdata = test_data)

# Confusion Matrix and Accuracy
cat("Confusion Matrix – Random Forest:\n")
## Confusion Matrix – Random Forest:
print(table(Predicted = rf_preds, Actual = test_data$Status))
##          Actual
## Predicted Current Left
##   Current     858  173
##   Left         93  174
rf_accuracy <- mean(rf_preds == test_data$Status)
cat("Prediction Accuracy (Random Forest):", round(rf_accuracy, 4), "\n\n")
## Prediction Accuracy (Random Forest): 0.7951
# Variable Importance Plot
varImpPlot(rf_model, main = "Variable Importance – Random Forest")

Bagging – Variable Importance

Random Forest – Variable Importance

Lets try get to get best accuracy using important predictors.

# Load necessary library
library(tree)

# --------------------------
# Step 1: Build Model D1
# --------------------------
model_d1 <- tree(Status ~ TotalCharges + Contract + Tenure + MonthlyCharges,
                 data = train_data)

# Summary
summary(model_d1)
## 
## Classification tree:
## tree(formula = Status ~ TotalCharges + Contract + Tenure + MonthlyCharges, 
##     data = train_data)
## Variables actually used in tree construction:
## [1] "Contract"       "MonthlyCharges" "Tenure"        
## Number of terminal nodes:  6 
## Residual mean deviance:  0.886 = 4593 / 5184 
## Misclassification error rate: 0.2139 = 1110 / 5190
# Plot the tree
plot(model_d1)
text(model_d1, pretty = 0)

# --------------------------
# Step 2: Predict & Accuracy
# --------------------------
pred_d1 <- predict(model_d1, newdata = test_data, type = "class")

# Confusion Matrix
conf_matrix_d1 <- table(Predicted = pred_d1, Actual = test_data$Status)
print(conf_matrix_d1)
##          Actual
## Predicted Current Left
##   Current     902  234
##   Left         49  113
# Accuracy
accuracy_d1 <- mean(pred_d1 == test_data$Status)
cat("Prediction Accuracy (Unpruned Tree – Model D1):", round(accuracy_d1, 4), "\n")
## Prediction Accuracy (Unpruned Tree – Model D1): 0.782
# --------------------------
# Step 3: Cross-Validation
# --------------------------
set.seed(123)
cv_d1 <- cv.tree(model_d1, FUN = prune.misclass)

# Plot CV Error vs Tree Size
plot(cv_d1$size, cv_d1$dev, type = "b",
     xlab = "Tree Size (Terminal Nodes)",
     ylab = "Cross-Validation Error",
     main = "CV Error vs Tree Size – Model D1")

# Get best size
best_size_d1 <- cv_d1$size[which.min(cv_d1$dev)]
cat("Best tree size for Model D1:", best_size_d1, "\n")
## Best tree size for Model D1: 6
# --------------------------
# Step 4: Prune the tree
# --------------------------
pruned_d1 <- prune.tree(model_d1, best = best_size_d1)

# Plot pruned tree
plot(pruned_d1)
text(pruned_d1, pretty = 0)

# Predict using pruned tree
pruned_d1_preds <- predict(pruned_d1, newdata = test_data, type = "class")

# Confusion Matrix
conf_matrix_pruned_d1 <- table(Predicted = pruned_d1_preds, Actual = test_data$Status)
print(conf_matrix_pruned_d1)
##          Actual
## Predicted Current Left
##   Current     902  234
##   Left         49  113
# Accuracy
accuracy_pruned_d1 <- mean(pruned_d1_preds == test_data$Status)
cat("Prediction Accuracy (Pruned Tree – Model D1):", round(accuracy_pruned_d1, 4), "\n")
## Prediction Accuracy (Pruned Tree – Model D1): 0.782
# Load necessary library
library(tree)

# --------------------------
# Step 1: Build Model D2
# --------------------------
model_d2 <- tree(Status ~ TotalCharges + Contract + Tenure + MonthlyCharges +
                             InternetService + OnlineSecurity,
                 data = train_data)

# Summary
summary(model_d2)
## 
## Classification tree:
## tree(formula = Status ~ TotalCharges + Contract + Tenure + MonthlyCharges + 
##     InternetService + OnlineSecurity, data = train_data)
## Variables actually used in tree construction:
## [1] "Contract"        "InternetService" "Tenure"         
## Number of terminal nodes:  6 
## Residual mean deviance:  0.8761 = 4542 / 5184 
## Misclassification error rate: 0.2094 = 1087 / 5190
# Plot the tree
plot(model_d2)
text(model_d2, pretty = 0)

# --------------------------
# Step 2: Predict & Accuracy
# --------------------------
pred_d2 <- predict(model_d2, newdata = test_data, type = "class")

# Confusion Matrix
conf_matrix_d2 <- table(Predicted = pred_d2, Actual = test_data$Status)
print(conf_matrix_d2)
##          Actual
## Predicted Current Left
##   Current     903  229
##   Left         48  118
# Accuracy
acc_d2 <- mean(pred_d2 == test_data$Status)
cat("Prediction Accuracy (Unpruned Tree – Model D2):", round(acc_d2, 4), "\n")
## Prediction Accuracy (Unpruned Tree – Model D2): 0.7866
# --------------------------
# Step 3: Cross-Validation
# --------------------------
set.seed(123)
cv_d2 <- cv.tree(model_d2, FUN = prune.misclass)

# Plot CV Error vs Tree Size
plot(cv_d2$size, cv_d2$dev, type = "b",
     xlab = "Tree Size (Terminal Nodes)",
     ylab = "Cross-Validation Error",
     main = "CV Error vs Tree Size – Model D2")

# Get best size
best_size_d2 <- cv_d2$size[which.min(cv_d2$dev)]
cat("Best tree size for Model D2:", best_size_d2, "\n")
## Best tree size for Model D2: 6
# --------------------------
# Step 4: Prune the Tree
# --------------------------
pruned_d2 <- prune.tree(model_d2, best = best_size_d2)

# Plot pruned tree
plot(pruned_d2)
text(pruned_d2, pretty = 0)

# Predict using pruned tree
pruned_d2_preds <- predict(pruned_d2, newdata = test_data, type = "class")

# Confusion Matrix
conf_matrix_pruned_d2 <- table(Predicted = pruned_d2_preds, Actual = test_data$Status)
print(conf_matrix_pruned_d2)
##          Actual
## Predicted Current Left
##   Current     903  229
##   Left         48  118
# Accuracy
accuracy_pruned_d2 <- mean(pruned_d2_preds == test_data$Status)
cat("Prediction Accuracy (Pruned Tree – Model D2):", round(accuracy_pruned_d2, 4), "\n")
## Prediction Accuracy (Pruned Tree – Model D2): 0.7866
# Create a dataframe with prediction accuracies for different Decision Tree models
dt_results <- data.frame(
  Model = c("Full Model (Unpruned)", 
            "Full Model (Pruned)",
            "Model D1 (Unpruned)", 
            "Model D1 (Pruned)", 
            "Model D2 (Unpruned)", 
            "Model D2 (Pruned)"),
  Accuracy = c(0.7866, 0.7866, 0.782, 0.782, 0.7866, 0.7866)
)

# View the results
print(dt_results)
##                   Model Accuracy
## 1 Full Model (Unpruned)   0.7866
## 2   Full Model (Pruned)   0.7866
## 3   Model D1 (Unpruned)   0.7820
## 4     Model D1 (Pruned)   0.7820
## 5   Model D2 (Unpruned)   0.7866
## 6     Model D2 (Pruned)   0.7866

7. Comparison across Methods

Compare across methods (skip the model built with decision trees) used above and report your best method based on ROC plots.

# Create AUC comparison dataframe
auc_comparison <- data.frame(
  Method = c("Logistic Regression (Model 2)", 
             "Naive Bayes (Model N2)", 
             "LDA (Model R4)", 
             "QDA (Model Q3)"),
  AUC = c(0.8465, 0.8316, 0.8405, 0.8407)
)

# View the dataframe
print(auc_comparison)
##                          Method    AUC
## 1 Logistic Regression (Model 2) 0.8465
## 2        Naive Bayes (Model N2) 0.8316
## 3                LDA (Model R4) 0.8405
## 4                QDA (Model Q3) 0.8407

Which model does the best in terms of the prediction accuracy on the test set? Include the decision tree model here.

# Create prediction accuracy dataframe
prediction_accuracy <- data.frame(
  Method = c("Logistic Regression (Model 2)", 
             "Naive Bayes (Model N2)", 
             "LDA (Model R4)", 
             "QDA (Model Q3)",
             "Decision Tree (Model D2: pruned)"),
  Pred_accuracy = c(0.8028, 0.7904, 0.7989, 0.7781, 0.7866)
)

# View the dataframe
print(prediction_accuracy)
##                             Method Pred_accuracy
## 1    Logistic Regression (Model 2)        0.8028
## 2           Naive Bayes (Model N2)        0.7904
## 3                   LDA (Model R4)        0.7989
## 4                   QDA (Model Q3)        0.7781
## 5 Decision Tree (Model D2: pruned)        0.7866

8. Business Analysis and Recommendations

In this section, we will use the Implementation_Data file to generate business recommendations. This data file has information on 500 current customers. Use your logistic regression model with your best threshold (identified in part 2) to answer the following questions:

In terms of relative importance how would you rate the predictors in your model.

# Load the dataset
load("Implementation_Data.rda")
str(Implementation_Data)
## 'data.frame':    500 obs. of  16 variables:
##  $ Gender          : Factor w/ 2 levels "Female","Male": 2 1 1 2 2 2 2 1 1 2 ...
##  $ SeniorCitizen   : int  0 0 1 1 1 0 0 1 0 0 ...
##  $ Partner         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 1 2 1 2 1 ...
##  $ Dependents      : Factor w/ 2 levels "No","Yes": 2 2 1 1 1 1 2 1 1 1 ...
##  $ Tenure          : int  45 64 26 23 26 2 67 71 2 72 ...
##  $ PhoneService    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ MultipleLines   : Factor w/ 3 levels "No","No phone service",..: 1 3 1 1 1 3 1 3 1 1 ...
##  $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 3 2 1 3 3 3 1 2 1 3 ...
##  $ OnlineSecurity  : Factor w/ 3 levels "No","No internet service",..: 2 3 1 2 2 2 3 3 1 2 ...
##  $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 2 1 3 2 2 2 1 1 1 2 ...
##  $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 2 3 3 2 2 2 1 1 1 2 ...
##  $ Contract        : Factor w/ 3 levels "Month-to-month",..: 3 3 2 2 2 1 3 2 1 3 ...
##  $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 2 2 2 2 ...
##  $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 1 4 3 2 4 1 2 2 2 ...
##  $ MonthlyCharges  : num  19.2 110.3 60.7 20.8 19.3 ...
##  $ TotalCharges    : num  904 6997 1597 485 504 ...
head(Implementation_Data)
# Convert SeniorCitizen column to a factor in the implementation dataset
# This ensures consistency with the model which was trained using SeniorCitizen as a factor
Implementation_Data <- Implementation_Data %>%
  mutate(SeniorCitizen = as.factor(SeniorCitizen))
# Get summary of the logistic regression model (Model 2)
summary(log_model_reduced)
## 
## Call:
## glm(formula = Status ~ SeniorCitizen + Tenure + PhoneService + 
##     MultipleLines + InternetService + OnlineSecurity + Contract + 
##     PaperlessBilling + PaymentMethod + MonthlyCharges + TotalCharges, 
##     family = "binomial", data = train_data)
## 
## Coefficients: (2 not defined because of singularities)
##                                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                           1.245e-01  2.644e-01   0.471 0.637643    
## SeniorCitizen1                        2.907e-01  9.522e-02   3.053 0.002265 ** 
## Tenure                               -6.932e-02  7.392e-03  -9.378  < 2e-16 ***
## PhoneServiceYes                      -8.410e-01  1.744e-01  -4.822 1.42e-06 ***
## MultipleLinesNo phone service                NA         NA      NA       NA    
## MultipleLinesYes                      2.521e-01  9.512e-02   2.650 0.008046 ** 
## InternetServiceFiber optic            8.072e-01  1.615e-01   4.997 5.81e-07 ***
## InternetServiceNo                    -5.271e-01  2.205e-01  -2.391 0.016811 *  
## OnlineSecurityNo internet service            NA         NA      NA       NA    
## OnlineSecurityYes                    -5.711e-01  1.007e-01  -5.669 1.44e-08 ***
## ContractOne year                     -6.808e-01  1.273e-01  -5.349 8.87e-08 ***
## ContractTwo year                     -1.349e+00  1.979e-01  -6.816 9.37e-12 ***
## PaperlessBillingYes                   2.849e-01  8.602e-02   3.312 0.000926 ***
## PaymentMethodCredit card (automatic) -1.305e-01  1.345e-01  -0.971 0.331685    
## PaymentMethodElectronic check         4.293e-01  1.104e-01   3.890 0.000100 ***
## PaymentMethodMailed check            -7.842e-02  1.347e-01  -0.582 0.560524    
## MonthlyCharges                        1.968e-03  4.904e-03   0.401 0.688137    
## TotalCharges                          4.154e-04  8.306e-05   5.001 5.71e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6019.5  on 5189  degrees of freedom
## Residual deviance: 4300.5  on 5174  degrees of freedom
## AIC: 4332.5
## 
## Number of Fisher Scoring iterations: 6
  • In terms of relative importance, the predictors with the strongest statistical significance (smallest p-values) in the model are Tenure, Contract, TotalCharges, OnlineSecurity, and InternetService. These variables have the most substantial impact on the predicted outcome and should be prioritized for business action and strategy. Other significant contributors include SeniorCitizen, PhoneService, MultipleLines, PaperlessBilling, and PaymentMethod (Electronic check).

How many customers does your model predict are in danger of leaving?

# Step 1: Make probability predictions using your best logistic model (Model 2)
impl_probs <- predict(log_model_reduced, newdata = Implementation_Data, type = "response")

# Step 2: Apply the best threshold (0.5) to classify
impl_predictions <- ifelse(impl_probs > 0.5, "Left", "Current")

# Step 3: Count how many customers are predicted to leave
table(impl_predictions)
## impl_predictions
## Current    Left 
##     389     111
num_at_risk <- sum(impl_predictions == "Left")
cat("Number of customers predicted to leave:", num_at_risk, "\n")
## Number of customers predicted to leave: 111
  • Using our best-performing logistic regression model (Model 2) and a probability threshold of 0.5, we identified that:

  • 111 out of 500 current customers are predicted to be at risk of leaving the company.This represents 22.2% of the current customer base from the implementation dataset.

What is the predicted loss in revenue per month if all these customers leave? This reflects the loss if no action is taken.?

# Step 1: Predict probabilities using Logistic Regression Model 2
log_probs_impl <- predict(log_model_reduced, newdata = Implementation_Data, type = "response")

# Step 2: Assign predicted classes using threshold = 0.5
predicted_classes <- ifelse(log_probs_impl > 0.5, "Left", "Current")

# Step 3: Filter customers predicted to leave
customers_to_leave <- Implementation_Data[predicted_classes == "Left", ]

# Step 4: Calculate total predicted monthly revenue loss
total_revenue_loss <- sum(customers_to_leave$MonthlyCharges)

# Step 5: Print result
cat("Total predicted monthly revenue loss if all predicted-to-leave customers leave: $", round(total_revenue_loss, 2), "\n")
## Total predicted monthly revenue loss if all predicted-to-leave customers leave: $ 8883.75
As a business manager, which factors would you focus on (for example you could invest in offering some incentives or promotions) to decrease the chances of customers leaving?

To answer this, we’ll look at the significant predictors from our logistic regression model (Model 2), especially those that are:

  • Actionable (can be influenced by the business)
  • Strongly associated with leaving, based on the model’s coefficients and p-values
Key Focus Areas for Management

Based on the logistic regression results and business insight, the following predictors should be prioritized for customer retention strategies:

1. Contract Type

Customers on Month-to-month contracts are far more likely to leave. Action: Offer discounts or perks to encourage switching to 1-year or 2-year contracts.

2. Online Security

Customers without online security services tend to leave more frequently. Action: Provide free trials or bundle online security features as add-ons to retain these customers.

3. Internet Service

Customers using Fiber Optic or with no internet service are at higher risk of leaving. Action: Promote stable, reliable internet plans, such as DSL or upgraded service packages.

4. Payment Method

Customers paying via Electronic Checks are more likely to leave. Action: Encourage auto-pay or card payment methods by offering small incentives or bill credits.

5. Paperless Billing

Customers not enrolled in paperless billing are more likely to leave. Action: Promote paperless billing with rewards or credits to drive adoption and retention.

Propose an incentive scheme to your manager that can help reduce the loss in revenue by retaining some (or all) customers.

Incentive Scheme Proposal to Reduce Customer Loss - Based on our logistic regression model (Model 2), we identified 111 current customers who are at risk of leaving the company. These customers contribute approximately $8,883.75 in monthly revenue. If no action is taken, this revenue will be lost.

  • To mitigate this, we propose the following targeted incentive scheme:

Proposed Incentive

  • Offer a $20 incentive (e.g., bill credit, one-time discount, or premium add-on) to each at-risk customer to encourage retention. Options could include:

  • $20 off the next bill

  • Free one-month premium service or support

  • Loyalty rewards for contract renewal

Expected Outcome

Assuming a 50% retention rate, we expect to retain around half of the at-risk customers, preserving a portion of the monthly revenue.

Provide justification by evaluating costs and benefits of your incentive scheme. Costs will be the dollar amount in incentives given (for example). Benefits will be the revenues from these customers if they stay with your company. Compute the net benefits from your incentive scheme. Make a case in your report to your upper management for implementing your scheme.

# Step 1: Load the implementation dataset 

# Step 2: Ensure variable types match training data
Implementation_Data <- Implementation_Data %>%
  mutate(
    SeniorCitizen = as.factor(SeniorCitizen),
    Tenure = as.numeric(Tenure),
    MonthlyCharges = as.numeric(MonthlyCharges),
    TotalCharges = as.numeric(TotalCharges)
  )

# Step 3: Predict probabilities using the final logistic regression model
impl_probs <- predict(log_model_reduced, newdata = Implementation_Data, type = "response")

# Step 4: Assign class labels based on best threshold (e.g., 0.5)
impl_predicted_class <- ifelse(impl_probs >= 0.5, "Left", "Current")

# Step 5: Create a dataframe with predictions
Implementation_Data$Predicted_Status <- impl_predicted_class
Implementation_Data$Prob_Left <- impl_probs

# Step 6: Filter customers predicted to leave
predicted_leavers <- Implementation_Data %>%
  filter(Predicted_Status == "Left")

# Step 7: Cost-benefit calculation
incentive_per_customer <- 20         # Proposed incentive in USD
estimated_retention_rate <- 0.5      # Assume 50% retention with incentives

# Total predicted monthly revenue from customers at risk
total_monthly_revenue <- sum(predicted_leavers$MonthlyCharges)

# Estimated retained revenue (50% of at-risk customers)
estimated_retained_revenue <- total_monthly_revenue * estimated_retention_rate

# Incentive cost = $20 * number of at-risk customers
total_incentive_cost <- nrow(predicted_leavers) * incentive_per_customer

# Net benefit
net_benefit <- estimated_retained_revenue - total_incentive_cost

# Output
cat("Number of At-Risk Customers:", nrow(predicted_leavers), "\n")
## Number of At-Risk Customers: 111
cat("Total Monthly Revenue at Risk: $", round(total_monthly_revenue, 2), "\n")
## Total Monthly Revenue at Risk: $ 8883.75
cat("Estimated Retained Revenue (50%): $", round(estimated_retained_revenue, 2), "\n")
## Estimated Retained Revenue (50%): $ 4441.88
cat("Total Incentive Cost: $", total_incentive_cost, "\n")
## Total Incentive Cost: $ 2220
cat("Net Benefit: $", round(net_benefit, 2), "\n")
## Net Benefit: $ 2221.88

Recommendation to Management - To mitigate the risk of losing approximately $8,883.75 in monthly revenue, we recommend implementing a retention incentive of $20 per customer for the 111 at-risk customers identified by our model.

  • By doing so:

  • We expect to retain at least 50% of these customers,

  • Resulting in a retained revenue of $4,441.88,

  • At a cost of $2,220.00 in incentives,

  • Yielding a net monthly benefit of $2,221.88.

  • This plan is cost-effective, data-driven, and aligned with our retention goals. It balances short-term costs with long-term revenue preservation and should be approved for immediate implementation.